Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return.
When the company receives a loan application, the company has to make a decision for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:
The dataset we have been given contains the information about past loan applicants and whether they ‘defaulted’ or not. The aim is to identify patterns which indicate if a person is likely to default, which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.
In this case study, we will use Exploratory Data Analysis to understand how consumer attributes and loan attributes influence the tendency of default.
To use Exploratory Data Analysis (EDA) to understand how consumer attributes and loan attributes influence the tendency of default. From the insights identified, we will provide recommendations on what attributes can be considered to reduce loans that result in credit loss.
Load the given dataset
# Import header files
import numpy as np
import pandas as pd
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Load data
loan = pd.read_csv("loan.csv", low_memory=False)
#Get an overview of the structure of the data
loan.head()
| id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | ... | num_tl_90g_dpd_24m | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1077501 | 1296599 | 5000 | 5000 | 4975.0 | 36 months | 10.65% | 162.87 | B | B2 | ... | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
| 1 | 1077430 | 1314167 | 2500 | 2500 | 2500.0 | 60 months | 15.27% | 59.83 | C | C4 | ... | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
| 2 | 1077175 | 1313524 | 2400 | 2400 | 2400.0 | 36 months | 15.96% | 84.33 | C | C5 | ... | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
| 3 | 1076863 | 1277178 | 10000 | 10000 | 10000.0 | 36 months | 13.49% | 339.31 | C | C1 | ... | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
| 4 | 1075358 | 1311748 | 3000 | 3000 | 3000.0 | 60 months | 12.69% | 67.79 | B | B5 | ... | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
5 rows × 111 columns
#Print number of rows in the dataset
print(len(loan))
39717
#Identify structure of dataset
loan.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 39717 entries, 0 to 39716 Columns: 111 entries, id to total_il_high_credit_limit dtypes: float64(74), int64(13), object(24) memory usage: 33.6+ MB
We have 39717 rows and 111 columns of data in the original dataset
Customer variables are those variables that are generated after a loan is approved. These metrics will not be used to determine the outcome of this case study. Therefore, these columns can be dropped.
cust_var_col = ['application_type',
'collection_recovery_fee',
'delinq_2yrs',
'earliest_cr_line',
'inq_last_6mths',
'last_credit_pull_d',
'last_pymnt_amnt',
'last_pymnt_d',
'open_acc',
'out_prncp',
'out_prncp_inv',
'pub_rec',
'recoveries',
'revol_bal',
'revol_util',
'total_acc',
'total_pymnt',
'total_pymnt_inv',
'total_rec_int',
'total_rec_late_fee',
'total_rec_prncp',
'url']
loan = loan.drop(cust_var_col, axis=1)
loan.shape
(39717, 89)
loan = loan.dropna(axis=1, how='all')
loan.shape
(39717, 35)
msno.bar(loan)
<Axes: >
Our aim is to remove those columns with more than 50% missing values. From the above chart, it is clear that the following columns have more than 50% missing values and therefore, can be removed.
col_to_drop = ['mths_since_last_delinq',
'mths_since_last_record',
'next_pymnt_d']
loan = loan.drop(col_to_drop, axis = 1)
loan.shape
(39717, 32)
By skimming through the dataset, it looks like some columns have just one value or a combination of NA and 0s. Such columns will not affect our analysis. Let's remove them after a proper analysis of number of unique values in each column.
loan.nunique()
id 39717 member_id 39717 loan_amnt 885 funded_amnt 1041 funded_amnt_inv 8205 term 2 int_rate 371 installment 15383 grade 7 sub_grade 35 emp_title 28820 emp_length 11 home_ownership 5 annual_inc 5318 verification_status 3 issue_d 55 loan_status 3 pymnt_plan 1 desc 26527 purpose 14 title 19615 zip_code 823 addr_state 50 dti 2868 initial_list_status 1 collections_12_mths_ex_med 1 policy_code 1 acc_now_delinq 1 chargeoff_within_12_mths 1 delinq_amnt 1 pub_rec_bankruptcies 3 tax_liens 1 dtype: int64
Here, we can directly drop columns that have just one value as these metrics will not affect the analysis
col_to_drop = ['pymnt_plan',
'initial_list_status',
'collections_12_mths_ex_med',
'policy_code',
'acc_now_delinq',
'chargeoff_within_12_mths',
'delinq_amnt',
'tax_liens'
]
loan = loan.drop(col_to_drop, axis = 1)
loan.shape
(39717, 24)
Free text columns cannot be used for analysis since they would lack a pattern required for analysis. So drop them.
col_to_be_dropped = ['desc','title']
loan = loan.drop(col_to_be_dropped, axis = 1)
loan.shape
(39717, 22)
There are 39717 rows. If every value in a particular column is unique, it does not add any perspective to our analysis. Therefore, such columns can be dropped.
#Refer to loan.nunique() in section D for identifying those columns
col_to_be_dropped = ['id','member_id']
loan = loan.drop(col_to_be_dropped, axis = 1)
loan.shape
(39717, 20)
With the columns that we have left for analysis, let us understand what each column represents
Let's take a look at the columns that still have null values and check their proportions.
loan.shape
(39717, 20)
# Identify columns with small percentage of missing data to determine imputation methods.
(loan.isnull().sum()/len(loan))*100
loan_amnt 0.000000 funded_amnt 0.000000 funded_amnt_inv 0.000000 term 0.000000 int_rate 0.000000 installment 0.000000 grade 0.000000 sub_grade 0.000000 emp_title 6.191303 emp_length 2.706650 home_ownership 0.000000 annual_inc 0.000000 verification_status 0.000000 issue_d 0.000000 loan_status 0.000000 purpose 0.000000 zip_code 0.000000 addr_state 0.000000 dti 0.000000 pub_rec_bankruptcies 1.754916 dtype: float64
We see a very small percentage of missing data for emp_title, emp_length and pub_rec_bankruptcies. Therefore, the imputation method determined for these rows is to drop these rows.
loan = loan.dropna()
loan.shape
(36548, 20)
loan.loan_status.unique()
array(['Charged Off', 'Fully Paid', 'Current'], dtype=object)
Loan Status takes three values:
The goal of our analysis is to identify the kind of potential borrowers who will default and therefore, will have to be charged off - Investors must not invest in them. In addition, this will also help understand borrowers who will pay off fully.
Therefore, for our analysis, borrowers who are currently paying off their loans (loan_status = Current) are not required. We will remove them from our dataset.
loan = loan[loan.loan_status != 'Current']
loan.shape
(35482, 20)
Now that we have cleaned up the data, we move on to Exploratory Data Analysis.
loan.head()
| loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | emp_title | emp_length | home_ownership | annual_inc | verification_status | issue_d | loan_status | purpose | zip_code | addr_state | dti | pub_rec_bankruptcies | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2500 | 2500 | 2500.0 | 60 months | 15.27% | 59.83 | C | C4 | Ryder | < 1 year | RENT | 30000.0 | Source Verified | Dec-11 | Charged Off | car | 309xx | GA | 1.00 | 0.0 |
| 3 | 10000 | 10000 | 10000.0 | 36 months | 13.49% | 339.31 | C | C1 | AIR RESOURCES BOARD | 10+ years | RENT | 49200.0 | Source Verified | Dec-11 | Fully Paid | other | 917xx | CA | 20.00 | 0.0 |
| 5 | 5000 | 5000 | 5000.0 | 36 months | 7.90% | 156.46 | A | A4 | Veolia Transportaton | 3 years | RENT | 36000.0 | Source Verified | Dec-11 | Fully Paid | wedding | 852xx | AZ | 11.20 | 0.0 |
| 6 | 7000 | 7000 | 7000.0 | 60 months | 15.96% | 170.08 | C | C5 | Southern Star Photography | 8 years | RENT | 47004.0 | Not Verified | Dec-11 | Fully Paid | debt_consolidation | 280xx | NC | 23.51 | 0.0 |
| 7 | 3000 | 3000 | 3000.0 | 36 months | 18.64% | 109.43 | E | E1 | MKC Accounting | 9 years | RENT | 48000.0 | Source Verified | Dec-11 | Fully Paid | car | 900xx | CA | 5.35 | 0.0 |
# Identifying object types for the data
loan.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 35482 entries, 1 to 39680 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loan_amnt 35482 non-null int64 1 funded_amnt 35482 non-null int64 2 funded_amnt_inv 35482 non-null float64 3 term 35482 non-null object 4 int_rate 35482 non-null object 5 installment 35482 non-null float64 6 grade 35482 non-null object 7 sub_grade 35482 non-null object 8 emp_title 35482 non-null object 9 emp_length 35482 non-null object 10 home_ownership 35482 non-null object 11 annual_inc 35482 non-null float64 12 verification_status 35482 non-null object 13 issue_d 35482 non-null object 14 loan_status 35482 non-null object 15 purpose 35482 non-null object 16 zip_code 35482 non-null object 17 addr_state 35482 non-null object 18 dti 35482 non-null float64 19 pub_rec_bankruptcies 35482 non-null float64 dtypes: float64(5), int64(2), object(13) memory usage: 5.7+ MB
# Calculate the Percentage of Defaulters (Borrowers with Charged Off Loans) at Lending Club
loan_status_per = (loan.loan_status.value_counts(normalize=True))*100
loan_status_per
Fully Paid 85.8548 Charged Off 14.1452 Name: loan_status, dtype: float64
# Plot a Bar Chart to Visualize the Percentage of Defaulters at Lending Club
x_values = loan_status_per.index
y_values = loan_status_per.values
fig = go.Figure(data=go.Bar(x=x_values, y=y_values))
bar_width = 0.4
fig.update_traces(width=bar_width)
fig.update_layout(title='Loan Status Overview in Percentage',
xaxis_title='Loan Status',
yaxis_title='Percentage of Loans',
width=500,
height=500)
fig.show()
(loan.loan_amnt >= loan.funded_amnt).all()
True
(loan.funded_amnt >= loan.funded_amnt_inv).all()
True
# Plot a histogram to View the Spread of Loan Amounts
fig = go.Figure(data=go.Histogram(x=loan.loan_amnt))
fig.update_layout(title='Distribution of Loan Amount',
xaxis_title='Loan Amount',
yaxis_title='Frequency',
height = 700,
xaxis=dict(showgrid=True),
yaxis=dict(showgrid=True))
fig.show()
# Plot a box plot to view the spread of loan amounts Over Loan Status
fig = go.Figure(data=go.Box(x=loan.loan_status, y=loan.loan_amnt))
fig.update_layout(title='Spread of Loan Amount Across Various Loan Status',
yaxis_title='Loan Amount',
xaxis_title='Loan Status',
height = 700,
width = 700
)
# Display the chart
fig.show()
# Loan Amount is a Numerical Variable. Creating Bins Helps With Better Analysis.
# Define the loan amount bins similar to the bins taken by the box plot graph
bins = [0, 5000, 10000, 15000, 20000, 25000, 30000, 35000]
# Create a new column 'loan_amnt_bin' to categorize loan amounts into bins
loan['loan_amnt_bin'] = pd.cut(loan['loan_amnt'], bins)
#Plotly cannot handle category datatype, so convert to str in a new column for plotting graphs
loan['loan_amnt_bin_str'] = loan['loan_amnt_bin'].astype(str)
#Category array will be used as a variable in plotly graphs since string variables tend to get sorted while plotting.
#Here, we are preserving the order
loan_amnt_cat_arr = ['(0, 5000]','(5000, 10000]','(10000, 15000]','(15000, 20000]',
'(20000, 25000]','(25000, 30000]','(30000, 35000]']
#Charged-off loans dataset
loan_npaid = loan[loan.loan_status =='Charged Off']
#Fully-paid loans dataset
loan_fpaid = loan[loan.loan_status =='Fully Paid']
# Plot a Graph: Percentage of Loans Charged Off or Fully Paid per Loan Amount Bin
loan_count = loan.groupby(['loan_amnt_bin_str', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] /
(loan_count['Charged Off'] +
loan_count['Fully Paid']))*100,2)
loan_count['fpaid_ratio'] = round((loan_count['Fully Paid'] /
(loan_count['Charged Off'] +
loan_count['Fully Paid']))*100,2)
fig = go.Figure()
# Add Bar trace for Charged Off Loans
fig.add_trace(go.Bar(
x=loan_count.loan_amnt_bin_str,
y=loan_count.npaid_ratio,
name='Charged-Off Loans',
offset=-0.2,
marker=dict(color='Red')
))
# Add Bar trace for Fully Paid Loans
fig.add_trace(go.Bar(
x=loan_count.loan_amnt_bin_str,
y=loan_count.fpaid_ratio,
name='Fully Paid Loans',
offset=0.2,
marker=dict(color='Green')
))
bar_width = 0.4
fig.update_traces(width=bar_width)
# Update the layout
fig.update_layout(
title='Loan Status Across Various Loan Amount Bins',
xaxis_title='Loan Amount Bins',
yaxis_title='Percentage of Loans',
xaxis=dict(categoryorder='array', categoryarray=loan_amnt_cat_arr),
barmode='group'
)
# Display the chart
fig.show()
If funded_amnt_inv is equal to loan_amnt, it implies that the investor had full confidence on the borrower. Let us analyze if the confidence of the investor has any impact on the percentage of charged off loans.
# Create a New Column to State if the Lender Had Full Confidence in the Borrower or Not
loan['conf'] = np.where(loan['loan_amnt'] == loan['funded_amnt_inv'], 'Yes', 'No')
#Charged-off loans dataset
loan_npaid = loan[loan.loan_status =='Charged Off']
#Fully-paid loans dataset
loan_fpaid = loan[loan.loan_status =='Fully Paid']
loan.conf.value_counts()
No 17816 Yes 17666 Name: conf, dtype: int64
# Plot a Graph to Show: Did the Confidence of the Investor Have Any Impact on the Repayment of Loans?
fig = go.Figure(data=go.Bar(x=loan.conf.value_counts().index,
y=loan.conf.value_counts().values))
bar_width = 0.4
fig.update_traces(width=bar_width)
fig.update_layout(title='Number of Loans Over Confidence of Investors',
xaxis_title='Did the Investor Have Full Confidence?',
yaxis_title='Number of Loans',
width=500,
height=500)
fig.show()
# Filter Out Loans Where Investor Had Full Confidence
conf_npaid = loan_npaid.conf.value_counts().sort_index()
conf_total = loan.conf.value_counts().sort_index()
conf_per = ( conf_npaid / conf_total )*100
conf_per
No 14.475752 Yes 13.811842 Name: conf, dtype: float64
# Plot a Graph to Show: Did the Confidence of the Investor Have Any Impact on the Repayment of Loans?
x_values = conf_per.index
y_values = conf_per.values
fig = go.Figure(data=go.Bar(x=x_values, y=y_values))
bar_width = 0.4
fig.update_traces(width=bar_width)
fig.update_layout(title='Percentage of Charged-Off Loans Over Confidence of Investors',
xaxis_title='Did the Investor Have Full Confidence?',
yaxis_title='Percentage of Defaulters',
width=500,
height=500)
fig.show()
# Plot a graph to display number of loans given in each loan term
term_total = loan.term.value_counts().sort_index()
fig = go.Figure()
fig.add_trace(go.Bar(
x=term_total.index,
y=term_total.values,
name='Loan Term',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Number of Loans Over Loan Terms',
xaxis_title='Loan Term',
yaxis_title='Number of Loans'
)
fig.show()
# Plot a Graph to Display Percentage of Charged Off Loans Across Both the Loan Terms
term_npaid = loan_npaid.term.value_counts().sort_index()
term_fpaid = loan_fpaid.term.value_counts().sort_index()
term_npaid_per = (term_npaid.values / term_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=term_total.index,
y=term_npaid_per,
name='Loan Term',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Variation in Debtors Over Loan Terms',
xaxis_title='Loan Term',
yaxis_title='Percentage of Charged Off Loans'
)
fig.show()
# Convert loan rate to float object
loan.int_rate = loan.int_rate.apply(lambda x: float( x[:-1]))
# Plot a Histogram to View the Spread of Interest rates
fig = go.Figure(data=go.Histogram(x=loan.int_rate, nbinsx=20))
fig.update_layout(title='Distribution of Interest Rate',
xaxis_title='Interest Rate',
yaxis_title='Frequency',
height = 500,
xaxis=dict(showgrid=True),
yaxis=dict(showgrid=True))
fig.show()
# Plot a Graph to Dispaly Spread of Interest Rate Over Loan Status
fig = go.Figure(data=go.Box(x=loan.loan_status, y=loan.int_rate))
fig.update_layout(title='Spread of Interest Rate Over Loan Status',
yaxis_title='Loan Interest Rate',
xaxis_title='Loan Status',
height = 700,
width = 700
)
fig.show()
# Plot a Graph to Display Percentage of Charged Off Loans Across Interest Rates.
# Since Interest Rate is a Numeric Variable, we will bucket it first.
loan['int_rate_bin'] = round(loan.int_rate)
#Charged-off loans dataset
loan_npaid = loan[loan.loan_status =='Charged Off']
#Fully-paid loans dataset
loan_fpaid = loan[loan.loan_status =='Fully Paid']
int_rate_npaid = loan_npaid.int_rate_bin.value_counts().sort_index()
int_rate_total = loan.int_rate_bin.value_counts().sort_index()
int_rate_npaid_per = (int_rate_npaid.values / int_rate_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=int_rate_total.index,
y=int_rate_npaid_per,
name='Loan Term',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Variation in Debtors Over Loan Terms',
xaxis_title='Interest Rate',
yaxis_title='Percentage of Charged Off Loans',
)
fig.show()
# Plot a Histogram to View the Spread of Monthly Installments
fig = go.Figure(data=go.Histogram(x=loan.installment, nbinsx=20))
fig.update_layout(title='Distribution of Monthly Installments',
xaxis_title='Monthly Installments',
yaxis_title='Frequency',
height = 700,
xaxis=dict(showgrid=True),
yaxis=dict(showgrid=True))
fig.show()
# Plot a Graph to Dispaly Spread of Monthly Installment Amount Over Loan Status
fig = go.Figure(data=go.Box(x=loan.loan_status, y=loan.installment))
fig.update_layout(title='Spread of Monthly Installment Amount Over Loan Status',
yaxis_title='Monthly Installment Amount',
xaxis_title='Loan Status',
height = 700,
width = 700
)
fig.show()
# Plot a Graph to Display Percentage of Charged Off Loans Across Monthly Installment Amounts.
# Since Monthly Installment is a Numeric Variable, we will bucket it first.
# Define the installment bins similar to the bins taken by the box plot graph
bins = [0, 200, 400, 600, 800, 1000, 1200, 1400]
# Create a new column 'loan_amnt_bin' to categorize loan amounts into bins
loan['installment_bin'] = pd.cut(loan['installment'], bins)
#Plotly cannot handle category datatype, so convert to str in a new column for plotting graphs
loan['installment_bin_str'] = loan['installment_bin'].astype(str)
#Category array will be used as a variable in plotly graphs since string variables tend to get sorted while plotting.
#Here, we are preserving the order
installment_cat_arr = ['(0, 200]','(200, 400]','(400, 600]','(600, 800]',
'(800, 1000]','(1000, 1200]','(1200, 1400]']
#Charged-off loans dataset
loan_npaid = loan[loan.loan_status =='Charged Off']
#Fully-paid loans dataset
loan_fpaid = loan[loan.loan_status =='Fully Paid']
# Plot a Graph: Percentage of Loans Charged Off or Fully Paid per Installment Amount Bin
loan_count = loan.groupby(['installment_bin_str', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] /
(loan_count['Charged Off'] +
loan_count['Fully Paid']))*100,2)
loan_count['fpaid_ratio'] = round((loan_count['Fully Paid'] /
(loan_count['Charged Off'] +
loan_count['Fully Paid']))*100,2)
fig = go.Figure()
fig.add_trace(go.Bar(
x=loan_count.installment_bin_str,
y=loan_count.npaid_ratio,
name='Charged-Off Loans',
offset=-0.2,
marker=dict(color='Red')
))
fig.add_trace(go.Bar(
x=loan_count.installment_bin_str,
y=loan_count.fpaid_ratio,
name='Fully Paid Loans',
offset=0.2,
marker=dict(color='Green')
))
bar_width = 0.4
fig.update_traces(width=bar_width)
fig.update_layout(
title='Loan Status Across Various Monthly Installment Amount Bins',
xaxis_title='Installment Amount Bins',
yaxis_title='Percentage of Loans',
xaxis=dict(categoryorder='array', categoryarray=installment_cat_arr),
barmode='group'
)
fig.show()
# Plot a graph to display number of loans given for each grade
grade_total = loan.grade.value_counts().sort_index()
fig = go.Figure()
fig.add_trace(go.Bar(
x=grade_total.index,
y=grade_total.values,
name='Grade of Loan',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Number of Loans Over Loan Grade',
xaxis_title='Loan Grade',
yaxis_title='Number of Loans'
)
fig.show()
# Plot a Graph to Display Percentage of Charged Off Loans Across Loan Grades
grade_npaid = loan_npaid.grade.value_counts().sort_index()
grade_fpaid = loan_fpaid.grade.value_counts().sort_index()
grade_npaid_per = (grade_npaid.values / grade_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=grade_total.index,
y=grade_npaid_per,
name='Loan Grade',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Variation in Debtors Over Loan Grades',
xaxis_title='Loan Grade',
yaxis_title='Percentage of Charged Off Loans'
)
fig.show()
# Plot a graph to display number of loans given for each sub-grade
sub_grade_total = loan.sub_grade.value_counts().sort_index()
fig = go.Figure()
fig.add_trace(go.Bar(
x=sub_grade_total.index,
y=sub_grade_total.values,
name='Sub Grade of Loan',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Number of Loans Over Loan Sub Grade',
xaxis_title='Loan Sub Grade',
yaxis_title='Number of Loans'
)
fig.show()
# Plot a Graph to Display Percentage of Charged Off Loans Across Loan Sub Grades
sub_grade_npaid = loan_npaid.sub_grade.value_counts().sort_index()
sub_grade_fpaid = loan_fpaid.sub_grade.value_counts().sort_index()
sub_grade_npaid_per = (sub_grade_npaid.values / sub_grade_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=sub_grade_total.index,
y=sub_grade_npaid_per,
name='Loan Sub Grade',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Variation in Debtors Over Loan Sub Grades',
xaxis_title='Loan Sub Grade',
yaxis_title='Percentage of Charged Off Loans'
)
fig.show()
#Calculate Number of Borrowers Across Various Employee Titles
emp_title_total = ((loan.emp_title.value_counts(normalize=True))*100)
emp_title_total
US Army 0.363565
Bank of America 0.298743
IBM 0.183191
AT&T 0.160645
Kaiser Permanente 0.155008
...
Togo's 0.002818
Fox Restaurant Concepts 0.002818
ieb local1 0.002818
city of new haven board of education 0.002818
Fairfield Manufacturing 0.002818
Name: emp_title, Length: 27506, dtype: float64
There are 27506 unique employee titles. Categorizing these values might only result in a very small percentage of borrower who can be grouped together.
Let us confirm with a visualization
# Plot a Graph to Display Number of Loans Per Employee Title
fig = go.Figure()
fig.add_trace(go.Bar(
x=emp_title_total.head(15).index,
y=emp_title_total.head(15).values,
name='Employee Title',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Overview of Borrowers Across Employee Titles',
xaxis_title='Employee Title',
yaxis_title='Percentage of Borrowers',
)
fig.show()
# Plot a Graph to Display Number of Loans Across Employee Tenure
emp_length_total = loan.emp_length.value_counts().sort_index()
fig = go.Figure()
fig.add_trace(go.Bar(
x=emp_length_total.index,
y=emp_length_total.values,
name='Emploee Tenure',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
emp_len_cat_arr = ['< 1 year','1 year','2 years','3 years','4 years','5 years',
'6 years','7 years','8 years','9 years','10+ years']
fig.update_layout(
title='Number of Loans Over Employee Tenure',
xaxis_title='Employee Tenure',
yaxis_title='Number of Loans',
xaxis=dict(categoryorder='array', categoryarray=emp_len_cat_arr)
)
fig.show()
# Plot a Graph to Display Percentage of Charged Off Loans Across Employee Tenure
emp_length_npaid = loan_npaid.emp_length.value_counts().sort_index()
emp_length_fpaid = loan_fpaid.emp_length.value_counts().sort_index()
emp_length_npaid_per = (emp_length_npaid.values / emp_length_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=emp_length_total.index,
y=emp_length_npaid_per,
name='Employee tenure',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Variation in Debtors Over Employee Tenures',
xaxis_title='Employee Tenure',
yaxis_title='Percentage of Charged Off Loans',
xaxis=dict(categoryorder='array', categoryarray=emp_len_cat_arr)
)
fig.show()
# Plot Graph to Display Number of Loans Across Home Ownership Categories
home_ownership_total = loan.home_ownership.value_counts().sort_index()
fig = go.Figure()
fig.add_trace(go.Bar(
x=home_ownership_total.index,
y=home_ownership_total.values,
name='Home Ownership Status',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Number of Borrowers In Various Home Ownership Status',
xaxis_title='Home Ownership Status',
yaxis_title='Number of Borrowers',
)
fig.show()
# Plot Graph to Display Percentage of Charged Off Loans Across Home Ownership Categories
home_ownership_npaid = loan_npaid.home_ownership.value_counts().sort_index()
home_ownership_npaid_per = (home_ownership_npaid.values / home_ownership_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=home_ownership_total.index,
y=home_ownership_npaid_per,
name='Home Ownership Status',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Variation in Debtors Over Home Ownership Status (Ignore Other Category)',
xaxis_title='Home Ownership Status',
yaxis_title='Percentage of Charged Off Loans',
)
fig.show()
# Plot Graph to Display Spread of Annual Income Over Loan Status, understand outliers
fig = go.Figure(data=go.Box(x=loan.loan_status, y=loan.annual_inc))
fig.update_layout(title='Spread of Annual Income Across Various Loan Status',
yaxis_title='Annual Income',
xaxis_title='Loan Status',
height = 1000,
width = 700
)
fig.show()
#Check percentile for annual income
print("85th percentile:",np.percentile(loan.annual_inc,85))
print("90th percentile:",np.percentile(loan.annual_inc,90))
print("95th percentile:",np.percentile(loan.annual_inc,95))
85th percentile: 100000.0 90th percentile: 115000.0 95th percentile: 140000.0
We see that 95% of people have a salary of 140k or below. We will consider income from minimum to 140k for our analysis.
# Applying filter in annual income
# Plot a Histogram to Display Spread of Annual Income Over Loan Status
fig = go.Figure(data=go.Histogram(x=loan.annual_inc[loan.annual_inc < 140000]))
fig.update_layout(title='Distribution of Annual Income',
xaxis_title='Annual Income',
yaxis_title='Frequency',
height = 700,
xaxis=dict(showgrid=True),
yaxis=dict(showgrid=True))
fig.show()
# Define the annual income bins similar to the bins taken by the box plot graph (majority is below 50000)
bins = [0, 10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000,
110000, 120000, 130000, 140000]
# Create a new column 'annual_inc_bin' to categorize annual income into bins
loan['annual_inc_bin'] = pd.cut(loan['annual_inc'], bins)
#Plotly cannot handle category datatype, so convert to str in a new column for plotting graphs
loan['annual_inc_bin_str'] = loan['annual_inc_bin'].astype(str)
#Category array will be used as a variable in plotly graphs since string variables tend to get sorted while plotting.
#Here, we are preserving the order
annual_inc_cat_arr = ['(0, 10000]','(10000, 20000]','(20000, 30000]','(30000, 40000]',
'(40000, 50000]','(50000, 60000]','(60000, 70000]','(70000, 80000]',
'(80000, 90000]','(90000, 100000]','(100000, 110000]','(110000, 120000]',
'(120000, 130000]','(130000, 140000]']
#Redefine charged off and fully paid dataset since new column is added
#Charged-off loans dataset
loan_npaid = loan[loan.loan_status =='Charged Off']
#Fully-paid loans dataset
loan_fpaid = loan[loan.loan_status =='Fully Paid']
#Charged-off loans dataset
loan_annual_inc = loan[loan.annual_inc <= 140000]
loan_annual_inc_npaid = loan_annual_inc[loan_annual_inc.loan_status =='Charged Off']
# Count how many loans were charged off or fully paid in each annual income bin
loan_annual_inc = loan[loan.annual_inc <= 140000]
loan_count = loan_annual_inc.groupby(['annual_inc_bin_str', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
loan_count['fpaid_ratio'] = round((loan_count['Fully Paid'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
fig = go.Figure()
fig.add_trace(go.Bar(
x=loan_count.annual_inc_bin_str,
y=loan_count.npaid_ratio,
name='Charged-Off Loans',
offset=-0.2,
marker=dict(color='Red')
))
fig.add_trace(go.Bar(
x=loan_count.annual_inc_bin_str,
y=loan_count.fpaid_ratio,
name='Fully Paid Loans',
offset=0.2,
marker=dict(color='Green')
))
bar_width = 0.4
fig.update_traces(width=bar_width)
fig.update_layout(
title='Loan Status Across Various annual income Bins',
xaxis_title='Annual income Bins',
yaxis_title='Percentage of Annual income',
xaxis=dict(categoryorder='array', categoryarray=annual_inc_cat_arr),
barmode='group',
width = 900,
height = 700
)
fig.show()
# Plot a Graph to Display the Number of Loans Across Verification Status
verification_status_total = loan.verification_status.value_counts().sort_index()
fig = go.Figure()
fig.add_trace(go.Bar(
x=verification_status_total.index,
y=verification_status_total.values,
name='Verification Status',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Number of Loans Over Verification Status',
xaxis_title='Verification Status',
yaxis_title='Number of Loans'
)
fig.show()
# Plot a Graph to Display Percentage of Charged Off Loans Over Verification Status
verification_status_npaid = loan_npaid.verification_status.value_counts().sort_index()
verification_status_fpaid = loan_fpaid.verification_status.value_counts().sort_index()
verification_status_npaid_per = (verification_status_npaid.values / verification_status_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=verification_status_total.index,
y=verification_status_npaid_per,
name='Verification Status',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Percentage of Charged Off Loans By Verification Status',
xaxis_title='Verification Status',
yaxis_title='Percentage of Charged Off Loans'
)
fig.show()
#Line Chart to be added
# Derive month column from issue_d for further analysis
loan.issue_d.value_counts().head()
Dec-11 1918 Nov-11 1909 Oct-11 1795 Sep-11 1786 Aug-11 1693 Name: issue_d, dtype: int64
loan['issue_month'] = loan.issue_d.apply(lambda x: x[:3])
#Charged-off loans dataset
loan_npaid = loan[loan.loan_status =='Charged Off']
#Fully-paid loans dataset
loan_fpaid = loan[loan.loan_status =='Fully Paid']
# Plot a Graph to Display the Number of Loans Taken Across Various Months
issue_month_total = loan.issue_month.value_counts().sort_index()
issue_month_cat_arr = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
fig = go.Figure()
fig.add_trace(go.Bar(
x=issue_month_total.index,
y=issue_month_total.values,
name='Month of Issue',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Number of Loans Taken Over Various Months',
xaxis_title='Month of Issue',
yaxis_title='Number of Loans',
xaxis=dict(categoryorder='array', categoryarray=issue_month_cat_arr),
)
fig.show()
# Plot a Graph to Display Percentage of Charged Off Loans Over Month of Issue
issue_month_npaid = loan_npaid.issue_month.value_counts().sort_index()
issue_month_npaid_per = (issue_month_npaid.values / issue_month_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=issue_month_total.index,
y=issue_month_npaid_per,
name='Month of Issue',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Percentage of Charged Off Loans By Month of Issue',
xaxis_title='Month of Issue',
yaxis_title='Percentage of Charged Off Loans',
xaxis=dict(categoryorder='array', categoryarray=issue_month_cat_arr),
)
fig.show()
# Plot a Graph to Display the Number of Loans Taken For Various Purposes
purpose_total = (loan.purpose.value_counts()).sort_index()
fig = go.Figure()
fig.add_trace(go.Bar(
x=purpose_total.index,
y=purpose_total.values,
name='Loan Purpose',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Number of Loans Taken For Various Purposes',
xaxis_title='Loan Purpose',
yaxis_title='Number of Loans'
)
fig.show()
purpose_npaid = loan_npaid.purpose.value_counts().sort_index()
purpose_npaid_per = (purpose_npaid.values / purpose_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=purpose_total.index,
y=purpose_npaid_per,
name='Charged Off Loans',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Variation in Charged of Loans Across Various Loan Purposes',
xaxis_title='Purpose of Loan',
yaxis_title='Percentage of Charged Off Loans'
)
fig.show()
# Zip Code has 805 unique values
loan.zip_code.nunique()
805
loan.zip_code.head()
1 309xx 3 917xx 5 852xx 6 280xx 7 900xx Name: zip_code, dtype: object
loan.addr_state.nunique()
49
# Plot a Graph to Display the Number of Loans Taken in Various States
addr_state_total = (loan.addr_state.value_counts()).sort_index()
fig = go.Figure()
fig.add_trace(go.Bar(
x=addr_state_total.index,
y=addr_state_total.values,
name='State',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Number of Loans Taken in Various States',
xaxis_title='State',
yaxis_title='Number of Loans'
)
fig.show()
addr_state_npaid = loan_npaid.addr_state.value_counts().sort_index()
print("Total Number of States:",addr_state_total.nunique())
print("Total Number of States Where Loans Were Charged Off:",addr_state_npaid.nunique())
Total Number of States: 47 Total Number of States Where Loans Were Charged Off: 43
We can see that there are 4 states that do not have charged off loans. We will remove those states to calculate the percentage of charged off loans in the other states.
#Create 2 dataframes, one for total, one for charged off.
#Merge them to get the common state records
st_df = pd.DataFrame({'state':addr_state_total.index, 'loan_count':addr_state_total.values})
st_npaid_df = pd.DataFrame({'state':addr_state_npaid.index, 'loan_npaid_count':addr_state_npaid.values})
merged_st_df = pd.merge(st_df, st_npaid_df, on='state', how='inner')
merged_st_df.describe()
| loan_count | loan_npaid_count | |
|---|---|---|
| count | 45.000000 | 45.000000 |
| mean | 788.355556 | 111.533333 |
| std | 1124.925536 | 169.040958 |
| min | 10.000000 | 1.000000 |
| 25% | 163.000000 | 24.000000 |
| 50% | 399.000000 | 60.000000 |
| 75% | 955.000000 | 139.000000 |
| max | 6419.000000 | 993.000000 |
merged_st_df = merged_st_df.nlargest(35, 'loan_count')
# Plot a Graph to Display Percentage of Charged Off Loans Across States
addr_state_npaid_per = ( merged_st_df.loan_npaid_count / merged_st_df.loan_count ) *100
fig = go.Figure()
fig.add_trace(go.Bar(
x=merged_st_df.state,
y=addr_state_npaid_per,
name='State',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Variation in Charged off Loans Across States',
xaxis_title='State',
yaxis_title='Percentage of Charged Off Loans'
)
fig.show()
# Plot Graph to Display Spread of DTI Over Loan Status, understand outliers
fig = go.Figure(data=go.Box(x=loan.loan_status, y=loan.dti))
fig.update_layout(title='Spread of DTI Across Various Loan Status',
yaxis_title='DTI',
xaxis_title='Loan Status',
height = 1000,
width = 700
)
fig.show()
# Plot a Histogram to Display Spread of DTI Over Loan Status
fig = go.Figure(data=go.Histogram(x=loan.dti))
fig.update_layout(title='Distribution of DTI',
xaxis_title='DTI',
yaxis_title='Frequency',
height = 700,
xaxis=dict(showgrid=True),
yaxis=dict(showgrid=True))
fig.show()
# Plot a Graph to Display Percentage of Charged Off Loans Over DTI.
# Since DTI is a Numeric Variable, we will bucket it first.
loan['dti_bin'] = round(loan.dti)
loan_npaid = loan[loan.loan_status =='Charged Off']
dti_npaid = loan_npaid.dti_bin.value_counts().sort_index()
dti_total = loan.dti_bin.value_counts().sort_index()
dti_npaid_per = (dti_npaid.values / dti_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=dti_total.index,
y=dti_npaid_per,
name='DTI',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Variation in Debtors Over DTI',
xaxis_title='DTI',
yaxis_title='Percentage of Charged Off Loans',
)
fig.show()
loan.pub_rec_bankruptcies.value_counts()
0.0 34006 1.0 1471 2.0 5 Name: pub_rec_bankruptcies, dtype: int64
loan['pub_rec_bankruptcies_str'] = loan.pub_rec_bankruptcies.astype('str')
loan_npaid = loan[loan.loan_status =='Charged Off']
# Plot a Graph to Display the Number of Loans Taken if Bankruptcy Record Exists
pub_rec_bankruptcies_total = (loan.pub_rec_bankruptcies_str.value_counts()).sort_index()
fig = go.Figure()
fig.add_trace(go.Bar(
x=pub_rec_bankruptcies_total.index,
y=pub_rec_bankruptcies_total.values,
name='Record of Bankruptcies',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Number of Loans Taken if Bankruptcy Record Exists',
xaxis_title='Record of Bankruptcies',
yaxis_title='Number of Loans'
)
fig.show()
# Plot a Graph to Display the Percentage of Charged Off Loans if Bankruptcy Record Exists
pub_rec_bankruptcies_npaid = loan_npaid.pub_rec_bankruptcies_str.value_counts().sort_index()
pub_rec_bankruptcies_npaid_per = (pub_rec_bankruptcies_npaid.values / pub_rec_bankruptcies_total.values)*100
fig = go.Figure()
fig.add_trace(go.Bar(
x=pub_rec_bankruptcies_total.index,
y=pub_rec_bankruptcies_npaid_per,
name='Charged Off Loans',
marker=dict(color='Blue')
))
bar_width = 0.2
fig.update_traces(width=bar_width)
fig.update_layout(
title='Variation in Charged off Loans if Bankruptcy Record Exists',
xaxis_title='Bankruptcy Record Exists',
yaxis_title='Percentage of Charged Off Loans'
)
fig.show()
res = loan_npaid.corr(numeric_only = True)
res.index = res.index.astype(str)
fig = go.Figure(data=go.Heatmap(x=res.columns, y=res.index, z=res.values, colorscale='Blues'))
fig.update_layout(
title='Correlation between different numeric values')
# Display the chart
fig.show()
# Plot Graph to Display Spread of Loan Amount Over Loan Term
fig = go.Figure(data=go.Box(x=loan_npaid.term, y=loan_npaid.loan_amnt))
fig.update_layout(title='Spread of Loan Amount Across Various Loan Terms',
yaxis_title='Loan Amount',
xaxis=dict(title='Loan Term', categoryorder='category ascending'),
height = 1000,
width = 700
)
fig.show()
loan_count = loan.groupby(['term', 'loan_amnt_bin', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="loan_amnt_bin", columns="term", values="npaid_ratio")
fig = go.Figure()
fig.add_trace(go.Scatter(
x=loan_amnt_cat_arr,
y=pvt[' 36 months'],
mode='lines',
name='36 months')
)
fig.add_trace(go.Scatter(
x=loan_amnt_cat_arr,
y=pvt[' 60 months'],
mode='lines',
name='60 months')
)
fig.update_layout(
title='Spread of Charged Off Loan Amounts Over Loan Term',
xaxis=dict(title='Loan Amount',categoryorder='array', categoryarray=loan_amnt_cat_arr),
yaxis=dict(title='Percentage of Charged Off Loans')
)
fig.show()
# Create a scatter plot
fig = go.Figure(data=go.Scatter(x=loan_npaid.int_rate, y=loan_npaid.loan_amnt, mode='markers'))
# Update layout
fig.update_layout(
title='Spread of Loan Amount Over Interest Rate',
xaxis=dict(title='Interest Rate'),
yaxis=dict(title='Loan Amount')
)
# Show the plot
fig.show()
loan_count = loan.groupby(['loan_amnt_bin_str', 'int_rate_bin', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="loan_amnt_bin_str", columns="int_rate_bin", values="npaid_ratio")
row_labels = pvt.index.get_level_values("loan_amnt_bin_str")
column_labels = pvt.columns.get_level_values("int_rate_bin")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loan Amounts Over Interest Rate",
xaxis=dict(title="Interest Rate"),
yaxis=dict(title='Percentage of Charged Off Loan Amounts',categoryorder='array', categoryarray=loan_amnt_cat_arr)
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
# Create a scatter plot
fig = go.Figure(data=go.Scatter(x=loan_npaid.installment, y=loan_npaid.loan_amnt, mode='markers'))
# Update layout
fig.update_layout(
title='Distribution of Loan Amount Vs Installment Amounts',
xaxis=dict(title='Installment Amount'),
yaxis=dict(title='Loan Amount')
)
# Show the plot
fig.show()
Larger loan amounts have larger installment amounts, highly correlated. So installment amount can be left out for further analysis.
fig = go.Figure(data=go.Box(x=loan_npaid.grade, y=loan_npaid.loan_amnt))
fig.update_layout(title='Spread of Loan Amount Across Various Grades',
yaxis_title='Loan Amount',
xaxis=dict(title='Loan Grade', categoryorder='category ascending'),
height = 1000,
width = 700
)
fig.show()
loan_count = loan.groupby(['loan_amnt_bin_str', 'grade', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="loan_amnt_bin_str", columns="grade", values="npaid_ratio")
row_labels = pvt.index.get_level_values("loan_amnt_bin_str")
column_labels = pvt.columns.get_level_values("grade")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loan Amounts Over Loan Grades",
xaxis=dict(title="Grades"),
yaxis=dict(title='Percentage of Charged Off Loan Amounts',categoryorder='array', categoryarray=loan_amnt_cat_arr)
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
Higher loan amounts at e,f,g charge-off more. Confirms univariate for both the variables
fig = go.Figure(data=go.Box(x=loan_npaid.sub_grade, y=loan_npaid.loan_amnt))
fig.update_layout(title='Spread of Loan Amount Across Various Sub Grades',
yaxis_title='Loan Amount',
xaxis=dict(title='Loan Sub Grade', categoryorder='category ascending'),
height = 1000,
width = 700
)
fig.show()
loan_count = loan.groupby(['loan_amnt_bin_str', 'sub_grade', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="loan_amnt_bin_str", columns="sub_grade", values="npaid_ratio")
row_labels = pvt.index.get_level_values("loan_amnt_bin_str")
column_labels = pvt.columns.get_level_values("sub_grade")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loan Amounts Over Loan Sub Grades",
xaxis=dict(title="Sub Grades"),
yaxis=dict(title='Percentage of Charged Off Loan Amounts',categoryorder='array', categoryarray=loan_amnt_cat_arr)
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
fig = go.Figure(data=go.Box(x=loan_npaid.home_ownership, y=loan_npaid.loan_amnt))
fig.update_layout(title='Spread of Loan Amount Across Various Home Ownership Categories',
yaxis_title='Loan Amount',
xaxis=dict(title='Home Ownership Category', categoryorder='category ascending'),
height = 1000,
width = 700
)
fig.show()
loan_count = loan.groupby(['home_ownership', 'loan_amnt_bin', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="loan_amnt_bin", columns="home_ownership", values="npaid_ratio")
fig = go.Figure()
fig.add_trace(go.Scatter(
x=loan_amnt_cat_arr,
y=pvt['MORTGAGE'],
mode='lines',
name='Mortgage')
)
fig.add_trace(go.Scatter(
x=loan_amnt_cat_arr,
y=pvt['OWN'],
mode='lines',
name='Own')
)
fig.add_trace(go.Scatter(
x=loan_amnt_cat_arr,
y=pvt['RENT'],
mode='lines',
name='Rent')
)
fig.update_layout(
title='Spread of Charged Off Loan Amounts Over Home Onwership category',
xaxis=dict(title='Loan Amount',categoryorder='array', categoryarray=loan_amnt_cat_arr),
yaxis=dict(title='Percentage of Charged Off Loans')
)
fig.show()
# Create a scatter plot
fig = go.Figure(data=go.Scatter(x=loan_npaid.annual_inc[loan_npaid.annual_inc < 140000],
y=loan_npaid.loan_amnt, mode='markers'))
# Update layout
fig.update_layout(
title='Distribution of Loan Amount Vs Annual Income',
xaxis=dict(title='Annual Income'),
yaxis=dict(title='Loan Amount')
)
# Show the plot
fig.show()
loan_annual_inc = loan[loan.annual_inc <= 140000]
loan_count = loan_annual_inc.groupby(['loan_amnt_bin_str', 'annual_inc_bin_str', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="loan_amnt_bin_str", columns="annual_inc_bin_str", values="npaid_ratio")
row_labels = pvt.index.get_level_values("loan_amnt_bin_str")
column_labels = pvt.columns.get_level_values("annual_inc_bin_str")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loan Amounts Over Annual Income",
xaxis=dict(title="Annual Income",categoryorder='array', categoryarray=annual_inc_cat_arr),
yaxis=dict(title='Percentage of Charged Off Loan Amounts',categoryorder='array', categoryarray=loan_amnt_cat_arr)
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
People with annual income from 20k to 60k taking loan more than 15k are charging off more.
fig = go.Figure(data=go.Box(x=loan.issue_month, y=loan.loan_amnt))
fig.update_layout(title='Spread of Loan Amount Across All Months',
yaxis_title='Loan Amount',
xaxis=dict(title='Months', categoryorder='array', categoryarray=issue_month_cat_arr),
height = 1000,
width = 700
)
fig.show()
fig = go.Figure(data=go.Box(x=loan_npaid.purpose, y=loan_npaid.loan_amnt))
fig.update_layout(title='Spread of Loan Amount Over Loan Purpose',
yaxis_title='Loan Amount',
xaxis=dict(title='Loan Purpose', categoryorder='category ascending'),
height = 1000,
width = 700
)
fig.show()
loan_count = loan.groupby(['loan_amnt_bin_str', 'purpose', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="loan_amnt_bin_str", columns="purpose", values="npaid_ratio")
# Extract the row and column labels
row_labels = pvt.index.get_level_values("loan_amnt_bin_str")
column_labels = pvt.columns.get_level_values("purpose")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues' # You can choose any colorscale you prefer
)
# Create the layout
layout = go.Layout(
title="Spread of Charged Off Loan Amounts Over Loan Purpose",
xaxis=dict(title="Purpose"),
yaxis=dict(title="Percentage of Charged Off Loans",categoryorder='array', categoryarray=loan_amnt_cat_arr)
)
# Create the figure
fig = go.Figure(data=[heatmap], layout=layout)
# Display the plot
fig.show()
fig = go.Figure(data=go.Box(x=loan.addr_state, y=loan.loan_amnt))
fig.update_layout(title='Spread of Loan Amount Across States',
yaxis_title='Loan Amount',
xaxis=dict(title='States', categoryorder='category ascending'),
height = 1000,
width = 700
)
fig.show()
loan_state = loan[loan['addr_state'].isin(merged_st_df['state'])]
loan_count = loan_state.groupby(['loan_amnt_bin_str', 'addr_state', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="loan_amnt_bin_str", columns="addr_state", values="npaid_ratio")
# Extract the row and column labels
row_labels = pvt.index.get_level_values("loan_amnt_bin_str")
column_labels = pvt.columns.get_level_values("addr_state")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues' # You can choose any colorscale you prefer
)
# Create the layout
layout = go.Layout(
title="Spread of Charged Off Loan Amounts Across States",
xaxis=dict(title="State"),
yaxis=dict(title="Percentage of Charged Off Loans",categoryorder='array', categoryarray=loan_amnt_cat_arr)
)
# Create the figure
fig = go.Figure(data=[heatmap], layout=layout)
# Display the plot
fig.show()
# Create a scatter plot
fig = go.Figure(data=go.Scatter(x=loan.dti, y=loan.loan_amnt, mode='markers'))
# Update layout
fig.update_layout(
title='Distribution of Loan Amount Across DTI',
xaxis=dict(title='DTI'),
yaxis=dict(title='Loan Amount')
)
# Show the plot
fig.show()
loan_count = loan.groupby(['loan_amnt_bin_str', 'dti_bin', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="loan_amnt_bin_str", columns="dti_bin", values="npaid_ratio")
row_labels = pvt.index.get_level_values("loan_amnt_bin_str")
column_labels = pvt.columns.get_level_values("dti_bin")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loan Amounts Over DTI",
xaxis=dict(title="DTI"),
yaxis=dict(title='Percentage of Charged Off Loan Amounts',categoryorder='array', categoryarray=loan_amnt_cat_arr)
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
fig = go.Figure(data=go.Box(x=loan.pub_rec_bankruptcies, y=loan.loan_amnt))
fig.update_layout(title='Spread of Loan Amount Across States',
yaxis_title='Loan Amount',
xaxis=dict(title='Bankruptcies', categoryorder='category ascending'),
height = 1000,
width = 700
)
fig.show()
loan_count = loan.groupby(['pub_rec_bankruptcies_str', 'loan_amnt_bin', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="loan_amnt_bin", columns="pub_rec_bankruptcies_str", values="npaid_ratio")
fig = go.Figure()
fig.add_trace(go.Scatter(
x=loan_amnt_cat_arr,
y=pvt['0.0'],
mode='lines',
name='0')
)
fig.add_trace(go.Scatter(
x=loan_amnt_cat_arr,
y=pvt['1.0'],
mode='lines',
name='1')
)
fig.add_trace(go.Scatter(
x=loan_amnt_cat_arr,
y=pvt['2.0'],
mode='lines',
name='2')
)
fig.update_layout(
title='Spread of Charged Off Loan Amounts Over Bankruptcy Record',
xaxis=dict(title='Loan Amount',categoryorder='array', categoryarray=loan_amnt_cat_arr),
yaxis=dict(title='Percentage of Charged Off Loans')
)
fig.show()
fig = go.Figure(data=go.Violin(x=loan.term, y=loan.int_rate, box_visible=True, meanline_visible=True))
fig.update_layout(title='Spread of Interest Rates For Loan Terms',
yaxis_title='Interest Rate',
xaxis=dict(title='Loan Term', categoryorder='category ascending'),
height = 1000,
width = 700
)
fig.show()
loan_count = loan.groupby(['term', 'int_rate_bin', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="int_rate_bin", columns="term", values="npaid_ratio")
fig = go.Figure()
fig.add_trace(go.Scatter(
x=pvt.index,
y=pvt[' 36 months'],
mode='lines',
name='36 months')
)
fig.add_trace(go.Scatter(
x=pvt.index,
y=pvt[' 60 months'],
mode='lines',
name='60 months')
)
fig.update_layout(
title='Spread of Interest Rates Over Loan Term',
xaxis=dict(title='Interest Rate'),
yaxis=dict(title='Percentage of Charged Off Loans')
)
fig.show()
loan_count = loan.groupby(['sub_grade', 'term', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="sub_grade", columns="term", values="npaid_ratio")
# Plot a graph
fig = go.Figure()
row_labels = pvt.index.get_level_values("sub_grade")
column_labels = pvt.columns.get_level_values("term")
z=pvt.values
fig.add_trace(go.Bar(
x=pvt.index,
y=pvt[' 36 months'],
name='36 months',
offset=-0.2,
marker=dict(color='Blue')
))
fig.add_trace(go.Bar(
x=pvt.index,
y=pvt[' 60 months'],
name='60 months',
offset=0.2,
marker=dict(color='Green')
))
bar_width = 0.4
fig.update_traces(width=bar_width)
fig.update_layout(
title='Loan Sub-Grades In Each Term',
xaxis_title='Sub-Grade',
yaxis_title='Ratio of Charged Off Loans',
barmode='group'
)
fig.show()
loan.int_rate[(loan.term == ' 36 months') & (loan.sub_grade.isin(['G2','G4''G5'])) & (loan.loan_status == 'Charged Off')]
13994 22.48 17986 20.48 22900 20.03 35072 20.17 36163 19.16 Name: int_rate, dtype: float64
loan_home = loan[loan.home_ownership != 'OTHER']
loan_count = loan_home.groupby(['home_ownership', 'term', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="home_ownership", columns="term", values="npaid_ratio")
# Plot a graph
fig = go.Figure()
row_labels = pvt.index.get_level_values("home_ownership")
column_labels = pvt.columns.get_level_values("term")
z=pvt.values
fig.add_trace(go.Bar(
x=pvt.index,
y=pvt[' 36 months'],
name='36 months',
offset=-0.2,
marker=dict(color='Blue')
))
fig.add_trace(go.Bar(
x=pvt.index,
y=pvt[' 60 months'],
name='60 months',
offset=0.2,
marker=dict(color='Green')
))
bar_width = 0.4
fig.update_traces(width=bar_width)
fig.update_layout(
title='Home Ownership In Each Term',
xaxis_title='Home Ownership',
yaxis_title='Ratio of Charged Off Loans',
barmode='group'
)
fig.show()
fig = go.Figure(data=go.Violin(x=loan.term, y=loan.annual_inc[loan.annual_inc < 140000], box_visible=True, meanline_visible=True))
fig.update_layout(title='Spread of Annual Income Over Loan Term',
yaxis_title='Annual Income',
xaxis=dict(title='Loan Term', categoryorder='category ascending'),
height = 1000,
width = 700
)
fig.show()
loan_count = loan_annual_inc.groupby(['term', 'annual_inc_bin_str', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="annual_inc_bin_str", columns="term", values="npaid_ratio")
fig = go.Figure()
fig.add_trace(go.Scatter(
x=annual_inc_cat_arr,
y=pvt[' 36 months'],
mode='lines',
name='36 months')
)
fig.add_trace(go.Scatter(
x=annual_inc_cat_arr,
y=pvt[' 60 months'],
mode='lines',
name='60 months')
)
fig.update_layout(
title='Spread of Annual Income Over Loan Term',
xaxis=dict(title='Annual Income', categoryorder='array', categoryarray=annual_inc_cat_arr),
yaxis=dict(title='Percentage of Charged Off Loans')
)
fig.show()
loan_count = loan.groupby(['purpose', 'term', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="purpose", columns="term", values="npaid_ratio")
# Plot a graph
fig = go.Figure()
row_labels = pvt.index.get_level_values("purpose")
column_labels = pvt.columns.get_level_values("term")
z=pvt.values
fig.add_trace(go.Bar(
x=pvt.index,
y=pvt[' 36 months'],
name='36 months',
offset=-0.2,
marker=dict(color='Blue')
))
fig.add_trace(go.Bar(
x=pvt.index,
y=pvt[' 60 months'],
name='60 months',
offset=0.2,
marker=dict(color='Green')
))
bar_width = 0.4
fig.update_traces(width=bar_width)
fig.update_layout(
title='Purpose Vs Term',
xaxis_title='Purpose',
yaxis_title='Percentage of Charged Off Loans',
barmode='group'
)
fig.show()
fig = go.Figure(data=go.Violin(x=loan.term, y=loan.dti, box_visible=True, meanline_visible=True))
fig.update_layout(title='Spread of DTI Over Loan Term',
yaxis_title='DTI',
xaxis=dict(title='Loan Term', categoryorder='category ascending'),
height = 1000,
width = 700
)
fig.show()
loan_count = loan.groupby(['term', 'dti_bin', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="dti_bin", columns="term", values="npaid_ratio")
fig = go.Figure()
fig.add_trace(go.Scatter(
x=pvt.index,
y=pvt[' 36 months'],
mode='lines',
name='36 months')
)
fig.add_trace(go.Scatter(
x=pvt.index,
y=pvt[' 60 months'],
mode='lines',
name='60 months')
)
fig.update_layout(
title='Spread of DTI Over Loan Term',
xaxis=dict(title='DTI'),
yaxis=dict(title='Percentage of Charged Off Loans')
)
fig.show()
# See if interest rate and sub grade are related (for all loans, not just for charged off loans)
fig = go.Figure(data=go.Violin(x=loan.sub_grade, y=loan.int_rate, box_visible=True, meanline_visible=True))
fig.update_layout(title='Spread of Interest Rate Over Sub Grades',
yaxis_title='Interest Rate',
xaxis=dict(title='Sub Grade', categoryorder='category ascending'),
height = 1000,
width = 1000
)
fig.show()
loan_count = loan.groupby(['int_rate_bin', 'sub_grade', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="int_rate_bin", columns="sub_grade", values="npaid_ratio")
row_labels = pvt.index.get_level_values("int_rate_bin")
column_labels = pvt.columns.get_level_values("sub_grade")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loans at Various Interest Rates Over Loan Sub Grades",
xaxis=dict(title="Sub Grades"),
yaxis=dict(title='Percentage of Charged Off Loans at Various Interest Rates')
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
# Create a scatter plot
fig = go.Figure(data=go.Scatter(x=loan.annual_inc[loan.annual_inc<140000], y=loan.int_rate, mode='markers'))
# Update layout
fig.update_layout(
title='Distribution of Interest Rate Over Annual Income',
xaxis=dict(title='Annual Income'),
yaxis=dict(title='Interest Rate')
)
# Show the plot
fig.show()
loan_count = loan_annual_inc.groupby(['annual_inc_bin_str', 'int_rate_bin', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="annual_inc_bin_str", columns="int_rate_bin", values="npaid_ratio")
row_labels = pvt.index.get_level_values("annual_inc_bin_str")
column_labels = pvt.columns.get_level_values("int_rate_bin")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loan Amounts in Various Annual Income Brackets Over Interest Rate",
xaxis=dict(title="Interest Rate"),
yaxis=dict(title='Percentage of Charged Off Loan Amounts in Annual Income Brackets',categoryorder='array', categoryarray=annual_inc_cat_arr)
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
fig = go.Figure(data=go.Violin(x=loan.purpose, y=loan.int_rate, box_visible=True, meanline_visible=True))
fig.update_layout(title='Impact of Interest Rate Over Loan Purpose',
yaxis_title='Interest Rate',
xaxis=dict(title='Loan Purpose', categoryorder='category ascending'),
height = 1000,
width = 1000
)
fig.show()
loan_count = loan.groupby(['int_rate_bin', 'purpose', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="int_rate_bin", columns="purpose", values="npaid_ratio")
row_labels = pvt.index.get_level_values("int_rate_bin")
column_labels = pvt.columns.get_level_values("purpose")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loans at Various Interest Rates Over Loan Purpose",
xaxis=dict(title="Purpose"),
yaxis=dict(title='Percentage of Charged Off Loans at Various Interest Rates')
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
# Create a scatter plot
fig = go.Figure(data=go.Scatter(x=loan.dti, y=loan.int_rate, mode='markers'))
# Update layout
fig.update_layout(
title='Distribution of Interest Rate Over DTI',
xaxis=dict(title='DTI'),
yaxis=dict(title='Interest Rate')
)
# Show the plot
fig.show()
loan_count = loan_annual_inc.groupby(['int_rate_bin', 'dti_bin', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="int_rate_bin", columns="dti_bin", values="npaid_ratio")
row_labels = pvt.index.get_level_values("int_rate_bin")
column_labels = pvt.columns.get_level_values("dti_bin")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loan Amounts at Various Interest Rates Over DTI",
xaxis=dict(title="DTI"),
yaxis=dict(title='Percentage of Charged Off Loan Amounts at Various Interest Rates', categoryorder="category ascending")
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
fig = go.Figure(data=go.Violin(x=loan.pub_rec_bankruptcies_str, y=loan.int_rate, box_visible=True, meanline_visible=True))
fig.update_layout(title='Impact of Interest Rate Over Bankruptcies Record',
yaxis_title='Interest Rate',
xaxis=dict(title='Bankruptcies Record', categoryorder='category ascending'),
height = 1000,
width = 1000
)
fig.show()
fig = go.Figure(data=go.Violin(x=loan.purpose, y=loan.annual_inc[loan.annual_inc<140000], box_visible=True, meanline_visible=True))
fig.update_layout(title='Impact of Annual Income on Loan Purpose',
yaxis_title='Annual Income',
xaxis=dict(title='Loan Purpose', categoryorder='category ascending'),
height = 1000,
width = 1000
)
fig.show()
loan_count = loan_annual_inc.groupby(['annual_inc_bin_str', 'purpose', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="annual_inc_bin_str", columns="purpose", values="npaid_ratio")
row_labels = pvt.index.get_level_values("annual_inc_bin_str")
column_labels = pvt.columns.get_level_values("purpose")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loans at Various Income Brackets Over Loan Purpose",
xaxis=dict(title="Purpose"),
yaxis=dict(title='Percentage of Charged Off Loans at Various Income Brackets', categoryorder='array', categoryarray=annual_inc_cat_arr)
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
# Create a scatter plot
fig = go.Figure(data=go.Scatter(x=loan.annual_inc[loan.annual_inc<140000], y=loan.dti, mode='markers'))
# Update layout
fig.update_layout(
title='Distribution of Annual Income Over DTI',
xaxis=dict(title='Annual Income'),
yaxis=dict(title='DTI')
)
# Show the plot
fig.show()
loan_count = loan_annual_inc.groupby(['annual_inc_bin_str', 'dti_bin', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="annual_inc_bin_str", columns="dti_bin", values="npaid_ratio")
row_labels = pvt.index.get_level_values("annual_inc_bin_str")
column_labels = pvt.columns.get_level_values("dti_bin")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loan Amounts at Various Income Brackets Over DTI",
xaxis=dict(title="DTI"),
yaxis=dict(title='Percentage of Charged Off Loan Amounts at Various Income Brackets', categoryorder="array", categoryarray=annual_inc_cat_arr)
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
fig = go.Figure(data=go.Violin(x=loan.pub_rec_bankruptcies_str, y=loan.annual_inc[loan.annual_inc<140000], box_visible=True, meanline_visible=True))
fig.update_layout(title='Impact of Annual Income on Bankruptcies Records',
yaxis_title='Annual Income',
xaxis=dict(title='Banruptcies Records', categoryorder='category ascending'),
height = 1000,
width = 1000
)
fig.show()
fig = go.Figure(data=go.Violin(x=loan.purpose, y=loan.dti, box_visible=True, meanline_visible=True))
fig.update_layout(title='Impact of DTI on Loan Purpose',
yaxis_title='DTI',
xaxis=dict(title='Loan Purpose', categoryorder='category ascending'),
height = 1000,
width = 1000
)
fig.show()
loan_count = loan_annual_inc.groupby(['dti_bin', 'purpose', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="dti_bin", columns="purpose", values="npaid_ratio")
row_labels = pvt.index.get_level_values("dti_bin")
column_labels = pvt.columns.get_level_values("purpose")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loans for Various DTIs Over Loan Purpose",
xaxis=dict(title="Purpose"),
yaxis=dict(title='Percentage of Charged Off Loans for Various DTIs', categoryorder='category ascending')
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
loan_count = loan.groupby(['grade', 'purpose', 'loan_status']).size().unstack()
loan_count = loan_count.reset_index()
loan_count['npaid_ratio'] = round((loan_count['Charged Off'] / (loan_count['Charged Off'] + loan_count['Fully Paid']))*100,2)
pvt = pd.pivot_table(data=loan_count, index="grade", columns="purpose", values="npaid_ratio")
row_labels = pvt.index.get_level_values("grade")
column_labels = pvt.columns.get_level_values("purpose")
# Create the heatmap trace
heatmap = go.Heatmap(
x=column_labels,
y=row_labels,
z=pvt.values,
colorscale='Blues'
)
layout = go.Layout(
title="Spread of Charged Off Loans for Various Grades Over Loan Purpose",
xaxis=dict(title="Purpose"),
yaxis=dict(title='Percentage of Charged Off Loans for Various Grades', categoryorder='category ascending')
)
fig = go.Figure(data=[heatmap], layout=layout)
fig.show()
loan_an = loan_npaid[loan_npaid.annual_inc < 140000]
sns.pairplot(loan_an[['loan_amnt','term','int_rate','dti','annual_inc']])
plt.show()